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ABSTRACT 

The subcellular location database for Arabidopsis 
proteins (SUBA3, http://suba.plantenergy.uwa.edu. 
au) combines manual literature curation of large- 
scale subcellular proteomics, fluorescent protein 
visualization and protein-protein interaction (PPI) 
datasets with subcellular targeting calls from 22 
prediction programs. More than 14500 new experi- 
mental locations have been added since its first 
release in 2007. Overall, nearly 650000 new calls of 
subcellular location for 35388 non-redundant 
Arabidopsis proteins are included (almost six times 
the information in the previous SUBA version). A 
re-designed interface makes the SUBA3 site more 
intuitive and easier to use than earlier versions and 
provides powerful options to search for PPIs within 
the context of cell compartmentation. SUBA3 also 
includes detailed localization information for refer- 
ence organelle datasets and incorporates green 
fluorescent protein (GFP) images for many proteins. 
To determine as objectively as possible where a 
particular protein is located, we have developed 
SUBAcon, a Bayesian approach that incorporates 
experimental localization and targeting prediction 
data to best estimate a protein's location in the 
cell. The probabilities of subcellular location for 
each protein are provided and displayed as a picto- 
graphic heat map of a plant cell in SUBA3. 

INTRODUCTION 

The sequencing of the genome of the model plant 
Arabidopsis thaliana (1) and the subsequent development 
of extensive tools and datasets for its genetic dissection 
(2,3) has provided scientists with foundational 



information on the structure of model plant genomes 
and their coding capacities. However, the function of 
most Arabidopsis proteins still remains to be resolved. A 
key step towards understanding the metabolic or biochem- 
ical role of any protein is to define its subcellular location. 
Proteins found in distinct subcellular compartments 
are part of interconnected metabolic and regulatory 
pathways, can share similar characteristics and collectively 
define the function of the particular compartment. 
Aggregating the evidence for where all the proteins of 
Arabidopsis are located in cells is thus an important foun- 
dation for interpreting the role of each of its genes (4). 

Both in silico prediction methods and experimental 
approaches are widely used by researchers to determine 
the subcellular location of proteins. Computational pre- 
diction programs use various machine-learning algorithms 
that identify sequence features from the primary protein 
sequence to predict the subcellular location of a protein. 
These bioinformatic programs have become increasingly 
important for annotating newly sequenced genes and for 
providing testable hypotheses regarding protein localiza- 
tion and function (5). However, obviously it is desirable 
to use experimental data on protein location where this 
is available. Popular experimental approaches for 
subcellular determination in Arabidopsis include in vitro 
protein import studies into isolated organelles, in vivo 
protein tagging by fluorescent markers and cell fraction- 
ation followed by protein detection using enzyme activity 
measurements, immunolocalization or mass spectrometry 
(6). Shotgun proteomic studies employing mass spectrom- 
etry to identify peptides in purified subcellular compart- 
ments result in large, information-rich datasets, whereas 
targeted fluorescent protein studies allow directed analysis 
of location and can provide clear evidence of multi- 
targeting to several locations. Unfortunately, most of 
these experimental data for Arabidopsis proteins are scat- 
tered in the literature and biologists can spend a significant 
amount of time and effort in searching for all the available 
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localization information. Moreover, a large number of 
protein localizations can be reported in an article but 
not listed in the title, abstract or text. Therefore, it is 
not always easy to access experimental localization data 
from literature sources. In addition, curated subcellular 
proteomes and catalogues of GFP targeting information 
are not readily available as defined datasets. 

A number of key databases have been developed to 
integrate localization data from different sources, such 
as the Plant Proteomics Database (PPDB) (2), 
ATCHLORO (7) and ARAMEMNON (8). 
ARAMEMNON, e.g., has been designed to overcome 
the individual limitations of different types of predictors 
by combining their predictions and including experimental 
data as further evidence (8). Localization predictions are 
also reported in PPDB (2) and AT CHLORO (7) but 
the assigned subcellular locations are based solely on 
experimental evidence. Aggregators value-add the use of 
individual predictors and are recommended when 
investigating the subcellular location of a protein (9,10). 

The SUBcellular localization database for Arabidopsis 
proteins (SUB A) (4,11) brings together protein localiza- 
tion information for Arabidopsis proteins provided by dif- 
ferent prediction algorithms as well as experimental data 
and annotations. As a central hub for protein localization 
in Arabidopsis, SUBA has provided access to defined 
sets of localization data that have been collectively 
investigated by the research community for the last 15 
years. SUBA has been used extensively to define the 
location of specific proteins in hundreds of reports and 
also used to assess targeting prediction programs (12,13), 
identify the localization of protein families (4) and to 
assess metabolic network models (14,15). By expanding 
the curated information in SUBA3, including more pre- 
dictors of targeting, incorporating protein-protein inter- 
action (PPI) data and developing SUBAcon, a Bayesian 
approach to best estimate a protein's location in the cell, 
we have increased the value and reliability of the database. 

MATERIALS AND METHODS 

Database structure and interface 

SUBA3 utilizes the database programming language SQL 
(Structured Query Language) and is housed on a Linux 
server running Ubuntu 10.04 LTS. The SUBA3 web 
browser-based graphical user interface is written in 
Dynamic Hyper Text Markup Language that makes use 
of Asynchronous JavaScript and XML (AJAX) to interact 
with the SUBA server. The back-end of SUBA utilizes a 
number of PHP scripts that interact with the MySQL 
tables housing the SUBA data. Making use of complex 
JavaScript, the interface works best via the Mozilla 
Firefox, Google Chrome or Safari web browsers but will 
work on Microsoft Internet Explorer (6 and above). The 
use of JavaScript allows users to dynamically construct, 
via the interface, complex Boolean queries without the 
need to be proficient in SQL. Through the interface, 
SUBA3 can be easily queried to define subsets of 
proteins predicted or experimentally found to be located 
in different parts of the cell. SUBA3 leverages open-source 



technologies in order to provide a freely available 
platform at http://suba.plantenergy.uwa.edu.au. 

Experimental data sources 

The non-redundant nuclear Arabidopsis protein set in 
SUBA3 was obtained from The Arabidopsis Information 
Resource (TAIR, release 10) (16). Arabidopsis mito- 
chondrial (117) and chloroplast (87) open reading frame 
(ORF) sets were obtained from GenBank Y08501 and 
AP000423, respectively. SUBA3 currently contains a 
total of 35 388 distinct proteins. Primary attributes for 
proteins such as molecular weight, average hydropathicity 
and isoelectric point as well as functional assignments for 
each Arabidopsis locus were generated as described by 
Heazlewood et al. (4). Experimental subcellular localiza- 
tions of proteins by mass spectrometry studies were 
obtained by searching PubMed (http://www.ncbi.nlm. 
nih.gov/pubmed) with 'proteomics' and 'Arabidopsis' or 
'MS' and 'Arabidopsis', whereas localizations of proteins 
by GFP tagging were obtained using the keyword 
'Arabidopsis' in combination with 'fluorescent protein', 
'GFP', 'CFP', 'YFP' or 'RFP'. Articles were read to de- 
termine whether Arabidopsis proteins were localized and 
the Arabidopsis Genome Initiative (AGI) identifiers with 
their localizations were extracted directly from the text or 
from supplementary data. Mass spectrometry-based local- 
izations were obtained from 122 publications and repre- 
sent 7685 unique proteins. Protein localizations based on 
GFP tagging studies were obtained from 1074 articles and 
represent 2477 unique proteins. The textual descriptions 
were interpreted to fit the 1 1 subcellular locations defined 
in SUBA, along with a category of 'unclear' for those that 
could not be fitted to this structure. Additionally, location 
annotations from literature sources for Arabidopsis 
proteins add 262 758 entries from TAIR (16), Swiss-Prot 
(17) and AmiGO (18). PPI datasets of 12 080 protein pairs 
were obtained by searching the content of the IntAct 
database for interacting Arabidopsis proteins (19). In 
addition, 552 interacting PPI pairs were obtained by 
searching PubMed (http://www.ncbi.nlm.nih.gov/ 
pubmed) using the keywords 'Arabidopsis' in combin- 
ation with 'interact', 'interaction' or 'interacting'. The 
AGI identifiers of interacting Arabidopsis proteins were 
extracted directly from the text of the articles or from 
supplementary data. 

Subcellular location prediction 

Subcellular targeting predictions were carried out using 22 
different bioinformatic programs: AdaBoost (20), ATP 
(21), BaCelLo (22), ChloroP 1.1 (23), EpiLoc (24), 
iPSORT (25), MitoPred (26), MitoProt (27), MultiLoc2 
(28), Nucleo (29), PCLR 0.9 (30), Plant-mPLoc (31), 
PProwler 1.2 (32), Predotar vl.03 (33), PredSL (34), 
PTS1 (35), SLPFA (36), SLP-Local (37), SubLoc (38), 
TargetP 1.1 (5), WoLF PSORT (39) and YLoc (40). 
Targeting predictions were carried out on the full-length 
protein sequences obtained from TAIR10 (16). 
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RESULTS 

SUBA curation, interface and the update of 
experimental data 

SUBA3 currently comprises 783 025 pieces of subcellular 
location information for a total of 35 388 non-redundant 
Arabidopsis proteins (Figure 1). Of these data, 38 059 are 
calls from experimental evidence curated from the litera- 
ture as MS/MS, GFP and now PPI data. At the time of 
writing, there are 22 191 entries based on subcellular 
proteomic studies, representing 7685 distinct proteins 
from 122 publications. Additional data from 1074 differ- 
ent publications add 3788 entries based on GFP tagging 
studies and comprise 2477 distinct proteins (Figure 1). 
Combined, the experimental data cover a total of 9024 
non-redundant proteins localized by mass spectrometry 
or GFP tagging studies of which 1138 proteins have 
been localized by both methods. PPI data include 12 080 
distinct protein pairs from 534 publications (Figure 1). 
Further annotation of location from literature sources 
for Arabidopsis proteins obtained through Swiss- 
Prot (17) and TAIR (16) contributes a similar number of 
localizations with 138 393 and 109 340, respectively, 
whereas AmiGO (18) contributes 15 025 localizations. 
SUBA3 includes the expansion of the number of pre- 
dictors from 10 to 22, making use of many new (and 
better) predictors published in the last 6 years. A total 
of 482 208 calls are by prediction algorithms. SUBA3 
can be queried via a web browser interface, accessible 



via http://suba.plantenergy.uwa.edu.au (Figure 1). The 
interface allows users to ask a simple question about one 
protein or, even with no prior knowledge of SQL, to con- 
struct moderately complex SQL queries using drop-down 
menus and buttons. The interface employs a tabbed design 
featuring 'Home', 'Search', 'Results 1 and 'Help' tabs. 

The primary 'Search' tab involves pull-down menus and 
text boxes for the users' convenience that can also be used 
in combination with AND, OR, NOT and parentheses to 
build complex Boolean queries. Once a query has been 
submitted, the 'Results' page presents a table, which by 
default contains the AGI identifier, description and local- 
ization summary information from predictions, annota- 
tions, GFP, mass spectrometry and PPI data. Nearly all 
retrieved data are linked to a reference in PubMed (http:// 
www.ncbi.nlm.nih.gov/pubmed). Results can be sorted 
(ascending/descending) by field using the function menu. 
The function menu is activated by tracking the mouse over 
the column header and then selecting the emerging arrow. 
New columns can be added to the 'Results' tab window by 
selecting 'Columns' in the function menu and columns can 
be organized using drag and drop functionality. Thus, 
users are able to control which data columns are visible 
and the order in which they are displayed. If further 
analysis is desired, all results can be downloaded as a 
tab-delimited file by using the 'Download All Results' 
button. Each AGI identifier in the results page is hyper- 
linked to a 'SUBA flatfile' that provides a variety of in- 
formation and helpful links. These include detailed 
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Figure 1. SUBA3 curation, calculations, classification and the interface for interrogation. Blue boxes highlight existing sections in SUBA that have 
been significantly updated, red boxes highlight new sections added in SUBA3. 
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subcellular localization information and the capability to 
include and display GFP images. 

Selecting predictors for use for different subcellular 
compartments 

The large increase in number of predictors integrated in 
SUBA provides an opportunity to analyse their prediction 
sensitivity and specificity across a range of subcellular lo- 
cations. A large number of the algorithms that form the 
basis of these predictors call plastid, mitochondria or 
the secretory pathway. A smaller number predicts peroxi- 
some and nuclear targeting, and some give null predictions 
as cytosolic prediction. A different subset provides a 
breakdown of prediction in the secretory pathway to be 
vacuole, Golgi, plasma membrane, endoplasmic reticulum 
and extracellular environment. The coverage of 10 loca- 
tions defined in SUBA by the various predictors is 
illustrated in Figure 2. 

Combining experimental data and predictions 

Evaluating the large amount of data now available for 
many Arabidopsis proteins can be difficult for researchers 
not familiar with the experimental approaches or the pre- 
diction software. The limitations of these methods are 
seldom apparent to non-experts, often leading to overcon- 
fidence in the reported results. As more results accumu- 
late, so do conflicting data and predictions, making it 
increasingly hard to present a clear conclusion for 
SUBA users. To help reduce this confusion, SUBA now 



presents a consensus location (SUBAcon) based on 
Bayesian probabilities calculated from all the experimental 
data and predictions available for each protein (Figure 1). 
SUBAcon will be valuable to researchers unsure of 
how to evaluate the data themselves and also to re- 
searchers wishing to automate the evaluation of localiza- 
tion calls for genome-wide analyses (e.g. constructing 
compartmentalized metabolic networks). 

The development of SUBAcon and an assessment of its 
performance will be described elsewhere; in brief, two 
Bayesian classifiers have been integrated into SUBA 
using the 22 subcellular location prediction sets plus the 
SUBA3-curated GFP and mass spectrometry datasets as 
inputs into the models. The first classifier evaluates calls to 
plastid, mitochondrion, peroxisome, cytosol, nucleus and 
all calls for entry into the secretory pathway; the second 
classifier treats calls within the secretory pathway to the 
vacuole, Golgi, plasma membrane, endoplasmic reticulum 
and to the extracellular environment. Deriving the param- 
eters for the two naive Bayesian models requires 
estimating the accuracy of the location calls derived 
from each predictor or experimental approach. This was 
achieved using a protein 'reference set' (RS) compiled by 
manual analysis of TAIR10 annotation and MapMan (41) 
evaluation of biochemical pathways and functional 
groups. Locations in the RS are inferred by function, 
rather than by localization data alone and the set 
includes many proteins with dual or multiple locations. 
This continually improving RS set comprises over 5000 
proteins at the time of writing and can be investigated 
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Figure 2. Selecting predictors for use for different subcellular compartments. The output of 22 predictors of Arabidopsis protein location across 10 
locations are employed in SUBA. The locations predicted by each predictor are shown in green. In total, 6 predictors provide call for all 10 SUBA 
locations and 16 predictors generate calls for a subset of locations. 
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Figure 3. Using PPI data to define extensions of subcellular proteomes. 
(A) Mitochondria, (B) plastids and (C) peroxisomes. Blue is the experi- 
mentally confirmed set by GFP or MSMS, yellow are proteins that 
interact with the experimental organelle subset, novel interacting 
proteins (subset of yellow) were analysed for those that were predicted 
in another compartment (red), predicted in the same compartment 
(green) or experimentally found in another compartment (grey). 



through the SUBA3 search interface using the first row of 
pull-down menus. To obtain the final probabilities for 
proteins that enter the secretory pathway, the outputs of 
the two Bayesian models are combined by multiplying the 
probability values of locations in the 'secretory' model 
with the probability value of a secretory pathway call 
from the first model. The probability values of 
SUB Aeon can be viewed by tracking the mouse over the 
subcellular compartments of the pictographic plant cell 
heat map on the 4 SUBA3 flatfile'. 

PPI data as subcellular location tool 

Recently, large experimental PPI datasets for Arabidopsis 
proteins have been published (42,43), providing a new 
source of information that can be assessed for its utility 
to locate proteins within cells. By including these data in 
SUBA and allowing searches for proteins that are known 
to interact with a single protein or a subset of search 
proteins, we are able to use PPI data to extend experimen- 
tally defined subcellular proteomes. For example, the 



mitochondrial experimental proteome of 1017 overlaps 
with 622 proteins in PPI pairs (Figure 3A), defining 478 
proteins that have been shown to interact with a protein 
experimentally located in mitochondria but which have 
not been experimentally located in mitochondria them- 
selves. In this set of 478 proteins, 233 have been located 
elsewhere by MS or GFP, 6 were clearly predicted to be 
elsewhere, whereas 239 were predicted to be located in 
mitochondria (Figure 3A). This set of 239 are thus 
proteins predicted to be mitochondrially located and ex- 
perimentally interact with proteins known experimentally 
to be located in mitochondria, making this a strong set of 
candidates to extend the mitochondrial proteome by 
~20%. Similar analysis of plastids provided a set of 301 
proteins (extending the experimental set by ~15%, 
Figure 3B), whereas in peroxisomes, this set was only 
nine proteins (extending the experimental set by ~3%, 
Figure 3C). Analysis of these sets of interactions shows 
that the integration of PPI data can predict binding 
partners for plastid and mitochondrial heat shock 
proteins, thioredoxin/glutaredoxins and TPR/PPR pro- 
teins and propose unknown function binding partners of 
peroxin (PEX) proteins in peroxisomes. These PPI 
datasets of particular compartments can be rapidly 
generated by any user through the PPI text box below 
the '. . . protein does/does not interact with proteins(s) in 
list' menu row on the SUBA search interface and subse- 
quent analysis of SUBA results in Excel. Once the final set 
of interacting proteins is obtained, SUBA can be queried 
again via the PPI text box to obtain matched sets of inter- 
acting partners. 



CONCLUSION 

Through the combination of wider literature curation, ag- 
gregation of predictor calls and integration through the 
development of SUBAcon, we have significantly 
extended the richest online aggregation of information 
on subcellular location of proteins in Arabidopsis. The 
SUBA3 search interface allows simple inquires about 
single proteins, as well as very complex queries across 
these datasets to build subcellular proteomes, compare 
the performance of different techniques and assess the 
location of user-defined sets of proteins. Integration of 
PPI data allows researchers for the first time to easily 
explore the value of PPI in extending subcellular prote- 
omes of interest. The development of SUBAcon also 
provides a single probabilistic call of location for all 
Arabidopsis proteins that will aid system-level studies in 
Arabidopsis and will continue to improve over time as 
new experimental data are added to the database. 
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